{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nYDA72xl0a5e"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Trusted-AI/AIF360/blob/main/examples/sklearn/demo_mdss_classifier_metric_sklearn.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lXh4FWRp0aD8"
      },
      "source": [
        "## Bias scan using Multi-Dimensional Subset Scan (MDSS)\n",
        "\n",
        "\"Identifying Significant Predictive Bias in Classifiers\" https://arxiv.org/abs/1611.08292\n",
        "\n",
        "The goal of bias scan is to identify a subgroup(s) that has significantly more predictive bias than would be expected from an unbiased classifier. There are $\\prod_{m=1}^{M}\\left(2^{|X_{m}|}-1\\right)$ unique subgroups from a dataset with $M$ features, with each feature having $|X_{m}|$ discretized values, where a subgroup is any $M$-dimension\n",
        "Cartesian set product, between subsets of feature-values from each feature --- excluding the empty set. Bias scan mitigates this computational hurdle by approximately identifing the most statistically biased subgroup in linear time (rather than exponential).\n",
        "\n",
        "\n",
        "We define the statistical measure of predictive bias function, $score_{bias}(S)$ as a likelihood ratio score and a function of a given subgroup $S$. The null hypothesis is that the given prediction's odds are correct for all subgroups in $\\mathcal{D}$:\n",
        "\n",
        "$$H_{0}:odds(y_{i})=\\frac{\\hat{p}_{i}}{1-\\hat{p}_{i}}\\ \\forall i\\in\\mathcal{D}.$$\n",
        "\n",
        "The alternative hypothesis assumes some constant multiplicative bias in the odds for some given subgroup $S$:\n",
        "\n",
        "$$H_{1}:\\ odds(y_{i})=q\\frac{\\hat{p}_{i}}{1-\\hat{p}_{i}},\\ \\text{where}\\ q>1\\ \\forall i\\in S\\ \\mathrm{and}\\ q=1\\ \\forall i\\notin S.$$\n",
        "\n",
        "In the classification setting, each observation's likelihood is Bernoulli distributed and assumed independent. This results in the following scoring function for a subgroup $S$:\n",
        "\n",
        "\\begin{align*}\n",
        "score_{bias}(S)= & \\max_{q}\\log\\prod_{i\\in S}\\frac{Bernoulli(\\frac{q\\hat{p}_{i}}{1-\\hat{p}_{i}+q\\hat{p}_{i}})}{Bernoulli(\\hat{p}_{i})}\\\\\n",
        "= & \\max_{q}\\log(q)\\sum_{i\\in S}y_{i}-\\sum_{i\\in S}\\log(1-\\hat{p}_{i}+q\\hat{p}_{i}).\n",
        "\\end{align*}\n",
        "Our bias scan is thus represented as: $S^{*}=FSS(\\mathcal{D},\\mathcal{E},F_{score})=MDSS(\\mathcal{D},\\hat{p},score_{bias})$.\n",
        "\n",
        "where $S^{*}$ is the detected most anomalous subgroup, $FSS$ is one of several subset scan algorithms for different problem settings, $\\mathcal{D}$ is a dataset with outcomes $Y$ and discretized features $\\mathcal{X}$, $\\mathcal{E}$ are a set of expectations or 'normal' values for $Y$, and $F_{score}$ is an expectation-based scoring statistic that measures the amount of anomalousness between subgroup observations and their expectations.\n",
        "\n",
        "Predictive bias emphasizes comparable predictions for a subgroup and its observations and Bias scan provides a more general method that can detect and characterize such bias, or poor classifier fit, in the larger space of all possible subgroups, without a priori specification."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yVEUBtir04X9"
      },
      "outputs": [],
      "source": [
        "# Install AIF360\n",
        "!pip install 'aif360'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "2oasn4AW0aED"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "\n",
        "from aif360.sklearn.datasets import fetch_compas\n",
        "from aif360.sklearn.detectors import bias_scan\n",
        "from aif360.sklearn.metrics import mdss_bias_score"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sLL3T1sU0aEH"
      },
      "source": [
        "We'll demonstrate scoring a subset and finding the most anomalous subset with bias scan using the compas dataset.\n",
        "\n",
        "We can specify subgroups to be scored or scan for the most anomalous subgroup. Bias scan allows us to decide if we aim to identify bias as observing **lower** than predicted probabilities of recidivism, i.e. overestimation, (unprivileged) or observing **higher** than predicted probabilities, i.e. underestimation, (privileged).\n",
        "\n",
        "Note: categorical features must not be one-hot encoded since scanning those features may find subgroups that are not meaningful e.g., a subgroup with 2 race values."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "FGRJcJP80aEH"
      },
      "outputs": [],
      "source": [
        "cols = ['sex', 'race', 'age_cat', 'priors_count', 'c_charge_degree']\n",
        "X, y = fetch_compas(usecols=cols, binary_race=True)\n",
        "\n",
        "# Quantize priors count between 0, 1-3, and >3\n",
        "X['priors_count'] = pd.cut(X['priors_count'], [-1, 0, 3, 100],\n",
        "                           labels=['0', '1 to 3', 'More than 3'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qe5eUWVE0aEI"
      },
      "source": [
        "### Training\n",
        "We'll split the dataset and then train a simple classifier to predict the probability of the outcome; (0: Survived, 1: Recidivated)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "Qz-mlm9Y0aEJ",
        "outputId": "469c0fbd-8a56-4eb5-99bf-2e401726d59c"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th>sex</th>\n",
              "      <th>race</th>\n",
              "      <th>age_cat</th>\n",
              "      <th>priors_count</th>\n",
              "      <th>c_charge_degree</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>sex</th>\n",
              "      <th>race</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th rowspan=\"5\" valign=\"top\">Male</th>\n",
              "      <th>African-American</th>\n",
              "      <td>Male</td>\n",
              "      <td>African-American</td>\n",
              "      <td>Less than 25</td>\n",
              "      <td>0</td>\n",
              "      <td>M</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>African-American</th>\n",
              "      <td>Male</td>\n",
              "      <td>African-American</td>\n",
              "      <td>Greater than 45</td>\n",
              "      <td>1 to 3</td>\n",
              "      <td>M</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>Caucasian</th>\n",
              "      <td>Male</td>\n",
              "      <td>Caucasian</td>\n",
              "      <td>25 - 45</td>\n",
              "      <td>More than 3</td>\n",
              "      <td>F</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>African-American</th>\n",
              "      <td>Male</td>\n",
              "      <td>African-American</td>\n",
              "      <td>Less than 25</td>\n",
              "      <td>1 to 3</td>\n",
              "      <td>F</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>African-American</th>\n",
              "      <td>Male</td>\n",
              "      <td>African-American</td>\n",
              "      <td>25 - 45</td>\n",
              "      <td>1 to 3</td>\n",
              "      <td>F</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                        sex              race          age_cat priors_count  \\\n",
              "sex  race                                                                     \n",
              "Male African-American  Male  African-American     Less than 25            0   \n",
              "     African-American  Male  African-American  Greater than 45       1 to 3   \n",
              "     Caucasian         Male         Caucasian          25 - 45  More than 3   \n",
              "     African-American  Male  African-American     Less than 25       1 to 3   \n",
              "     African-American  Male  African-American          25 - 45       1 to 3   \n",
              "\n",
              "                      c_charge_degree  \n",
              "sex  race                              \n",
              "Male African-American               M  \n",
              "     African-American               M  \n",
              "     Caucasian                      F  \n",
              "     African-American               F  \n",
              "     African-American               F  "
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "(X_test, X_train,\n",
        " y_test, y_train) = train_test_split(X, y, test_size=3694, shuffle=True, random_state=0)\n",
        "\n",
        "X_train.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "Cf2FVBPx0aEM",
        "outputId": "e3816c51-94af-4086-ec6a-9ef641481e78"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "array(['Recidivated', 'Survived'], dtype=object)"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "clf = LogisticRegression(solver='lbfgs', C=1.0, penalty='l2', random_state=0)\n",
        "clf.fit(X_train.apply(lambda s: s.cat.codes), y_train)\n",
        "clf.classes_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cMBVAebG0aEO"
      },
      "source": [
        "predictions should reflect the probability of a favorable outcome (i.e. no recidivism)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "lXPEpsmt0aEP"
      },
      "outputs": [],
      "source": [
        "test_prob = clf.predict_proba(X_test.apply(lambda s: s.cat.codes))[:, 1]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "s_gzSzvf0aEQ",
        "outputId": "15b10052-2da0-4502-8385-16f3105684c2"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>sex</th>\n",
              "      <th>race</th>\n",
              "      <th>age_cat</th>\n",
              "      <th>priors_count</th>\n",
              "      <th>c_charge_degree</th>\n",
              "      <th>observed</th>\n",
              "      <th>probabilities</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Female</td>\n",
              "      <td>Caucasian</td>\n",
              "      <td>Greater than 45</td>\n",
              "      <td>More than 3</td>\n",
              "      <td>F</td>\n",
              "      <td>Recidivated</td>\n",
              "      <td>0.552951</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Female</td>\n",
              "      <td>African-American</td>\n",
              "      <td>25 - 45</td>\n",
              "      <td>0</td>\n",
              "      <td>F</td>\n",
              "      <td>Survived</td>\n",
              "      <td>0.740959</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Male</td>\n",
              "      <td>Caucasian</td>\n",
              "      <td>Less than 25</td>\n",
              "      <td>1 to 3</td>\n",
              "      <td>F</td>\n",
              "      <td>Survived</td>\n",
              "      <td>0.374728</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Male</td>\n",
              "      <td>African-American</td>\n",
              "      <td>Greater than 45</td>\n",
              "      <td>More than 3</td>\n",
              "      <td>F</td>\n",
              "      <td>Recidivated</td>\n",
              "      <td>0.444487</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Male</td>\n",
              "      <td>Caucasian</td>\n",
              "      <td>25 - 45</td>\n",
              "      <td>1 to 3</td>\n",
              "      <td>M</td>\n",
              "      <td>Recidivated</td>\n",
              "      <td>0.584908</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "      sex              race          age_cat priors_count c_charge_degree  \\\n",
              "0  Female         Caucasian  Greater than 45  More than 3               F   \n",
              "1  Female  African-American          25 - 45            0               F   \n",
              "2    Male         Caucasian     Less than 25       1 to 3               F   \n",
              "3    Male  African-American  Greater than 45  More than 3               F   \n",
              "4    Male         Caucasian          25 - 45       1 to 3               M   \n",
              "\n",
              "      observed  probabilities  \n",
              "0  Recidivated       0.552951  \n",
              "1     Survived       0.740959  \n",
              "2     Survived       0.374728  \n",
              "3  Recidivated       0.444487  \n",
              "4  Recidivated       0.584908  "
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df = X_test.copy()\n",
        "df['observed'] = y_test\n",
        "df['probabilities'] = test_prob\n",
        "df.reset_index(drop=True).head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RsieqjLM0aES"
      },
      "source": [
        "In this example, we assume that the model makes systematic under or over estimatations of the recidivism risk for certain subgroups and our aim is to identify these subgroups"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "enf68WyK0aES"
      },
      "source": [
        "### Bias scoring\n",
        "\n",
        "We'll call the bias scoring function and score the test set. The `privileged` argument indicates the direction for which to scan for bias depending on the positive label. In our case since the positive label is 0 ('Survived'), `True` corresponds to checking for underestimated risk of recidivism and `False` corresponds to checking for overestimated risk of recidivism."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "ZN1YkR6x0aET",
        "outputId": "890421fa-0778-48e7-8f08-a2ffce372572"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "-0.0\n",
            "-0.0\n"
          ]
        }
      ],
      "source": [
        "print(mdss_bias_score(df['observed'], df['probabilities'], pos_label='Survived',\n",
        "                      X=df.iloc[:, :-2], subset={'sex': ['Female']},\n",
        "                      overpredicted=True))\n",
        "print(mdss_bias_score(df['observed'], df['probabilities'], pos_label='Survived',\n",
        "                      X=df.iloc[:, :-2], subset={'sex': ['Male']},\n",
        "                      overpredicted=False))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "r5W7-R-n0aEU",
        "outputId": "d140e34f-009f-4537-cb2a-544b1147d255"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "0.63\n",
            "1.1769\n"
          ]
        }
      ],
      "source": [
        "print(mdss_bias_score(df['observed'], df['probabilities'], pos_label='Survived',\n",
        "                      X=df.iloc[:, :-2], subset={'sex': ['Male']},\n",
        "                      overpredicted=True))\n",
        "print(mdss_bias_score(df['observed'], df['probabilities'], pos_label='Survived',\n",
        "                      X=df.iloc[:, :-2], subset={'sex': ['Female']},\n",
        "                      overpredicted=False))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IpbHyIbn0aEV"
      },
      "source": [
        "If we assume correctly, then our bias score is going to be higher; thus whichever of the assumptions results in a higher bias score has the most evidence of being true. This means males are likely privileged whereas females are likely unpriviledged by our classifier."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t3cTeDV60aEV"
      },
      "source": [
        "### Bias scan\n",
        "We get the bias score for the apriori defined subgroup but assuming we had no prior knowledge \n",
        "about the predictive bias and wanted to find the subgroups with the most bias, we can apply bias scan to identify the priviledged and unpriviledged groups. The privileged argument is not a reference to a group but the direction for which to scan for bias."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "bmxiLwWH0aEW",
        "outputId": "c1a5b76a-05b8-441a-a6fc-7219e48a298b"
      },
      "outputs": [],
      "source": [
        "privileged_subset = bias_scan(df[df.columns[:-2]], df['observed'], df['probabilities'],\n",
        "                              pos_label='Survived', penalty=0.5, overpredicted=True)\n",
        "unprivileged_subset = bias_scan(df[df.columns[:-2]], df['observed'], df['probabilities'],\n",
        "                                pos_label='Survived', penalty=0.5, overpredicted=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "BQVzouil0aEW",
        "outputId": "c10bf3fc-11f6-4d35-ca2d-a5b94ab54035"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "({'race': ['African-American'], 'age_cat': ['Less than 25'], 'sex': ['Male']}, 3.1526)\n",
            "({'sex': ['Female'], 'race': ['African-American']}, 3.3036)\n"
          ]
        }
      ],
      "source": [
        "print(privileged_subset)\n",
        "print(unprivileged_subset)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "7VOBEZQF0aEX"
      },
      "outputs": [],
      "source": [
        "assert privileged_subset[0]\n",
        "assert unprivileged_subset[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LJ50_G_z0aEX"
      },
      "source": [
        "We can observe that the bias score is higher than the score of the prior groups. These subgroups are guaranteed to be the highest scoring subgroup among the exponentially many subgroups.\n",
        "\n",
        "For the purposes of this example, the logistic regression model systematically under estimates the recidivism risk of individuals in the `African-American`, `Less than 25`, `Male` subgroup whereas individuals belonging to the `African-American`, `Female` subgroup are assigned a higher risk than is actually observed. We refer to these subgroups as the `detected privileged group` and `detected unprivileged group` respectively."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OJoMsQti0aEX"
      },
      "source": [
        "As noted in the paper, predictive bias is different from predictive fairness so there's no the emphasis in the subgroups having comparable predictions between them. \n",
        "We can investigate the difference in what the model predicts vs what we actually observed as well as the multiplicative difference in the odds of the subgroups."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "8xR8re-50aEX"
      },
      "outputs": [],
      "source": [
        "to_choose = df[privileged_subset[0].keys()].isin(privileged_subset[0]).all(axis=1)\n",
        "temp_df = df.loc[to_choose]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "rPHME8hD0aEY",
        "outputId": "8df140b3-8c86-4c36-d2f9-61d46db861ed",
        "scrolled": true
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'Our detected priviledged group has a size of 192, our model predicts 57.30% probability of recidivism but we observe 67.71% as the mean outcome'"
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "group_obs = temp_df['observed'].cat.codes.mean()\n",
        "group_prob = 1-temp_df['probabilities'].mean()\n",
        "\n",
        "\"Our detected priviledged group has a size of {}, our model predicts {:.2%} probability of recidivism but we observe {:.2%} as the mean outcome\"\\\n",
        ".format(len(temp_df), group_prob, group_obs)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "id": "ELJWYT6h0aEY",
        "outputId": "59d16cf4-805d-42da-9be1-a5a409571426"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'This is a multiplicative increase in the odds by 1.562'"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "odds_mul = (group_obs / (1 - group_obs)) / (group_prob /(1 - group_prob))\n",
        "\"This is a multiplicative increase in the odds by {:.3f}\".format(odds_mul)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "HgbAKAXr0aEY"
      },
      "outputs": [],
      "source": [
        "assert odds_mul > 1"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "Kqo_xwTq0aEY"
      },
      "outputs": [],
      "source": [
        "to_choose = df[unprivileged_subset[0].keys()].isin(unprivileged_subset[0]).all(axis=1)\n",
        "temp_df = df.loc[to_choose]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "id": "tzUpqGoS0aEZ",
        "outputId": "fbaf1373-dbd7-4adf-b4b5-08e016ad6a19"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'Our detected unpriviledged group has a size of 169, our model predicts 43.65% probability of recidivism but we observe 33.14% as the mean outcome'"
            ]
          },
          "execution_count": 17,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "group_obs = temp_df['observed'].cat.codes.mean()\n",
        "group_prob = 1-temp_df['probabilities'].mean()\n",
        "\n",
        "\"Our detected unpriviledged group has a size of {}, our model predicts {:.2%} probability of recidivism but we observe {:.2%} as the mean outcome\"\\\n",
        ".format(len(temp_df), group_prob, group_obs)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "id": "abXJJzI70aEZ",
        "outputId": "79487689-714c-42ee-ec25-ce1ab098fe95"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'This is a multiplicative decrease in the odds by 0.640'"
            ]
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "odds_mul = (group_obs / (1 - group_obs)) / (group_prob /(1 - group_prob))\n",
        "\"This is a multiplicative decrease in the odds by {:.3f}\".format(odds_mul)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {
        "id": "E1SgN9480aEZ"
      },
      "outputs": [],
      "source": [
        "assert odds_mul < 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TopRFOw80aEZ"
      },
      "source": [
        "In summary this notebook demonstrates the use of bias scan to identify subgroups with significant predictive bias, as quantified by a likelihood ratio score, using subset scanning. This allows consideration of not just subgroups of a priori interest or small dimensions, but the space of all possible subgroups of features.\n",
        "It also presents opportunity for a kind of bias mitigation technique that uses the multiplicative odds in the over-or-under estimated subgroups to adjust for predictive fairness."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3.9.7 ('aif360')",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.7"
    },
    "vscode": {
      "interpreter": {
        "hash": "d0c5ced7753e77a483fec8ff7063075635521cce6e0bd54998c8f174742209dd"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
